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Abstract 


An attentional mechanism has lately been 
used to improve neural machine transla¬ 
tion (NMT) by selectively focusing on 
parts of the source sentence during trans¬ 
lation. However, there has been little 
work exploring useful architectures for 
attention-based NMT. This paper exam¬ 
ines two simple and effective classes of at¬ 
tentional mechanism: a global approach 
which always attends to all source words 
and a local one that only looks at a subset 
of source words at a time. We demonstrate 
the effectiveness of both approaches on the 
WMT translation tasks between English 
and German in both directions. With local 
attention, we achieve a significant gain of 
5.0 BLEU points over non-attentional sys¬ 
tems that already incorporate known tech¬ 
niques such as dropout. Our ensemble 
model using different attention architec¬ 
tures yields a new state-of-the-art result in 
the WMT’15 English to German transla¬ 
tion task with 25.9 BLEU points, an im¬ 
provement of 1.0 BLEU points over the 
existing best system backed by NMT and 
an n-gram rerankerQ 

1 Introduction 


Neural Machine Translation (NMT) achieved 
state-of-the-art performances in large-scale trans¬ 
lation tasks such as from English to Erench 
(Luong et ah, 20151 and English to German 
dJean et ah, 2015| l. NMT is appealing since it re¬ 
quires minimal domain knowledge and is concep¬ 
tually simple. The model by Luong et al. (2015] l 
reads through all the source words until the end-of- 
sentence symbol <eos> is reached. It then starts 


*A11 our code and models are publicly available at 
http://nip.Stanford.edu/projects/nmt 



Eigure 1: Neural machine translation - a stack¬ 
ing recurrent architecture for translating a source 
sequence A B C D into a target sequence X Y 
Z. Here, <eos> marks the end of a sentence. 


emitting one target word at a time, as illustrated in 
Eigure[T] NMT is often a large neural network that 
is trained in an end-to-end fashion and has the abil¬ 
ity to generalize well to very long word sequences. 
This means the model does not have to explicitly 
store gigantic phrase tables and language models 
as in the case of standard MT; hence, NMT has 
a small memory footprint. Lastly, implementing 
NMT decoders is easy unlike the highly intricate 
decoders in standard MT dKoehn et ah, 20031 ). 

In parallel, the concept of “attention” has 
gained popularity recently in training neural net¬ 
works, allowing models to learn alignments be¬ 
tween different modalities, e.g., between image 
objects and agent actions in the dynamic con¬ 
trol problem dMnih et ah, 20141 ), between speech 
frames and text in the speech recognition task 
(?), or between visual features of a picture and 
its text description in the image caption gener¬ 
ation task dXu et ah, 2015| ). In the context of 
NMT, |Bahdanau et al. (2015) has successfully ap¬ 
plied such attentional mechanism to jointly trans¬ 
late and align words. To the best of our knowl¬ 
edge, there has not been any other work exploring 
the use of attention-based architectures for NMT. 

In this work, we design, with simplicity and ef- 




















fectiveness in mind, two novel types of attention- 
based models: a global approach in which all 
source words are attended and a local one whereby 
only a subset of source words are considered at a 
time. The former approach resembles the model 
of dBahdanau et al., 2015| ) but is simpler architec¬ 
turally. The latter can be viewed as an interesting 
blend between the hard and soft attention models 
proposed in ( |Xu et al., 2015| ): it is computation¬ 
ally less expensive than the global model or the 
soft attention; at the same time, unlike the hard at¬ 
tention, the local attention is differentiable almost 
everywhere, making it easier to implement and 
trainjl Besides, we also examine various align¬ 
ment functions for our attention-based models. 

Experimentally, we demonstrate that both of 
our approaches are effective in the WMT trans¬ 
lation tasks between English and German in both 
directions. Our attentional models yield a boost 
of up to 5.0 BEEU over non-attentional systems 
which already incorporate known techniques such 
as dropout. Eor English to German translation, 
we achieve new state-of-the-art (SOTA) results 
for both WMT’14 and WMT’15, outperforming 
previous SOTA systems, backed by NMT mod¬ 
els and n-gram EM rerankers, by more than 1.0 
BEEU. We conduct extensive analysis to evaluate 
our models in terms of learning, the ability to han¬ 
dle long sentences, choices of attentional architec¬ 
tures, alignment quality, and translation outputs. 


recurrent neural network (RNN) architec¬ 
ture, which most of the recent NMT work 
( |Kalchbrenner and Blunsom, 2013 


such 


as 


Sutskever et al., 2014 


|Cho et al., 2014 


Bahdanau et al., 2015 


Euong et al., 2015 


Jean et al., 2015| ) have in common. They, how¬ 
ever, differ in terms of which RNN architectures 
are used for the decoder and how the encoder 
computes the source sentence representation s. 

Kalchbrenner and Blunsom (201 3| ) used an 
RNN with the standard hidden unit for the 
decoder and a convolutional neural network for 
encoding the source sentence representation. On 
the other hand, both [Sutskever et al. (2014|) and 


Euong et al. (20151 stacked multiple layers of an 
RNN with a Eong Short-Term Memory (ESTM) 
hidden unit for both the encoder and the decoder. 
Cho et al. (2014| ), [Bahdanau et al. (2015[ ), and 


Jean et al. (2015[ ) all adopted a different version of 
the RNN with an ESTM-inspired hidden unit, the 
gated recurrent unit (GRU), for both components^ 
In more detail, one can parameterize the proba¬ 
bility of decoding each word yj as: 

P {yj\y<j, «) = softmax {g (hj)) (2) 

with g being the transformation function that out¬ 
puts a vocabulary-sized vector]^ Here, hj is the 
RNN hidden unit, abstractly computed as: 


hj = f{hj_i,s), 


(3) 


2 Neural Machine Translation 

A neural machine translation system is a neural 
network that directly models the conditional prob¬ 
ability p{y\x) of translating a source sentence, 
xi,..., Xn, to a target sentence, yi,..., ?/mll A 
basic form of NMT consists of two components: 
(a) an encoder which computes a representation s 
for each source sentence and (b) a decoder which 
generates one target word at a time and hence de¬ 
composes the conditional probability as: 

E nn 

.^^'^ogp{yj\y<j,s) ( 1 ) 

A natural choice to model such a de¬ 
composition in the decoder is to use a 

^There is a recent work by jGregor et al. (2015^ , which is 
very similar to our local attention and applied to the image 
generation task. However, as we detail later, our model is 
much simpler and can achieve good performance for NMT. 

^All sentences are assumed to terminate with a special 
“end-of-sentence” token <eos>. 


where / computes the current hidden state 
given the previous hidden state and can be 
either a vanilla RNN unit, a GRU, or an ESTM 
In ( [Kalchbrenner and Blunsom, 2013 


unit. 


[Sutskever et al., 2014 


Cho et al., 2014 


Euong et al., 2015|), the source representa¬ 


tion s is only used once to initialize the 
decoder hidden state. On the other hand, in 
( [Bahdanau et al, 2015t [Jean et al., 2015[ ) and 
this work, s, in fact, implies a set of source 
hidden states which are consulted throughout the 
entire course of the translation process. Such an 
approach is referred to as an attention mechanism, 
which we will discuss next. 

In this work, following ( [Sutskever et al., 2014 


Euong et al., 2015[), we use the stacking ESTM 


architecture for our NMT systems, as illustrated 


“^They all used a single RNN layer except for the latter two 
works which utilized a bidirectional RNN for the encoder. 

^One can provide g with other inputs such as the currently 
predicted word yj as in (Bahdanau et al., 2015|>. 









































































in Figure [T] We use the LSTM unit defined in 
dZaremba et ah, 20T5| ). Our training objective is 
formulated as follows: 


with D being our parallel training corpus. 

3 Attention-based Models 

Our various attention-based models are classifed 
into two broad categories, global and local. These 
classes differ in terms of whether the “attention” 
is placed on all source positions or on only a few 
source positions. We illustrate these two model 
types in Figure |2] and [3] respectively. 

Common to these two types of models is the fact 
that at each time step t in the decoding phase, both 
approaches first take as input the hidden state ht 
at the top layer of a stacking LSTM. The goal is 
then to derive a context vector q that captures rel¬ 
evant source-side information to help predict the 
current target word yt. While these models differ 
in how the context vector Ct is derived, they share 
the same subsequent steps. 

Specifically, given fhe largel hidden sfafe ht and 
the source-side context vector Ct, we employ a 
simple concatenation layer to combine the infor¬ 
mation from both vectors to produce an attentional 
hidden state as follows: 


ht = tanh{Wc[ct; ht]) (5) 

The attentional vector ht is then fed through the 
softmax layer to produce the predictive distribu¬ 
tion formulated as: 



Figure 2: Global attentional model - at each time 
step t, the model infers a variable-length align¬ 
ment weight vector at based on the current target 
state ht and all source states hg. A global context 
vector Ct is then computed as the weighted aver¬ 
age, according to at, over all the source states. 


Here, score is referred as a content-based function 
for which we consider three different alternatives: 

h^ hg dot 

hj Wahg general 

v~[ tanh {Wa[ht, b<j]) concat 

Besides, in our early attempts to build attention- 
based models, we use a location-based function 
in which the alignment scores are computed from 
solely the target hidden state ht as follows: 

at = softmax (Wa/it) location (8) 


score(/it, hs)= < 


p{yt\y<t,x) = softmax(Ws/it) (6) 

We now detail how each model type computes 
the source-side context vector Ct. 

3.1 Global Attention 

The idea of a global attentional model is to con¬ 
sider all the hidden states of the encoder when de¬ 
riving the context vector q. In this model type, 
a variable-length alignment vector at, whose size 
equals the number of time steps on the source side, 
is derived by comparing the current target hidden 
state ht with each source hidden state hg: 

at{s) = align{ht,hg) (7) 

exp (score(/it, hg)) 
exp (score(/it, hg/)) 


Given the alignment vector as weights, the context 
vector Ct is computed as the weighted average over 
all the source hidden states]^ 

Comparison to { Bahdanau et al, 20i5P - While 
our global attention approach is similar in spirit 
to the model proposed by Bahdanau et al. (20151 1, 
there are several key differences which reflect how 
we have both simplified and generalized from 
fhe original model. Firsf, we simply use hid¬ 
den sfafes al fhe fop LSTM layers in bofh fhe 
encoder and decoder as illuslrafed in Figure |2] 
Bahdanau ef al. (201511, on fhe ofher hand, use 


fhe concafenafion of fhe forward and backward 
source hidden sfafes in fhe bi-direclional encoder 


'"Eq. ® implies that all alignment vectors at are of the 
same length. For short sentences, we only use the top part of 
at and for long sentences, we ignore words near the end. 
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Figure 3: Local attention model - the model first 
predicts a single aligned position pt for the current 
target word. A window centered around the source 
position Pt is then used to compute a context vec¬ 
tor Cf, a weighted average of the source hidden 
states in the window. The weights at are inferred 
from the current target state ht and those source 
states hs in the window. 


and target hidden states in their non-stacking uni¬ 
directional decoder. Second, our computation path 
is simpler; we go from —)• —)• q —)• ht 

then make a prediction as detailed in Eq. ([51), 
Eq. and Eigure |2 On the other hand, at 
any time t, Bahdanau et al. (20151 ) build from the 
previous hidden state ht-i at ^ Ct ^ 
ht, which, in turn, goes through a deep-output 
and a maxout layer before making predictions0 
Easily, Bahdanau et al. (2015[ ) only experimented 
with one alignment function, the concat product; 
whereas we show later that the other alternatives 
are better. 


3.2 Local Attention 

The global attention has a drawback that it has to 
attend to all words on the source side for each tar¬ 
get word, which is expensive and can potentially 
render it impractical to translate longer sequences, 
e.g., paragraphs or documents. To address this 
deficiency, we propose a local affenfional mech¬ 
anism fhaf chooses fo focus only on a small subsef 
of fhe source positions per largel word. 

This model fakes inspiration from fhe fradeoff 
befween the soft and hard attentional models pro¬ 
posed by |Xu et al. (2015 1 to tackle the image cap¬ 
tion generation task. In their work, soft attention 


^We will refer to this difference again in Section [33 


refers to the global attention approach in which 
weights are placed “softly” over all patches in the 
source image. The hard attention, on the other 
hand, selects one patch of the image to attend to at 
a time. While less expensive at inference time, the 
hard attention model is non-differentiable and re¬ 
quires more complicated techniques such as vari¬ 
ance reduction or reinforcement learning to train. 

Our local attention mechanism selectively fo¬ 
cuses on a small window of context and is differ¬ 
entiable. This approach has an advantage of avoid¬ 
ing the expensive computation incurred in the soft 
attention and at the same time, is easier to train 
than the hard attention approach. In concrete de¬ 
tails, the model first generates an aligned position 
Pt for each target word at time t. The context vec¬ 
tor Ct is then derived as a weighted average over 
the set of source hidden states within the window 
\pt—D,pt+D]-, Dis empirically selectedH Unlike 
the global approach, the local alignment vector at 
is now fixed-dimensional, i.e., € We con¬ 

sider fwo varianfs of fhe model as below. 

Monotonic alignmenf (local-m) - we simply sef 
Pt = t assuming fhaf source and largel sequences 
are roughly monofonically aligned. The alignmenf 
veclor at is defined according fo Eq. 

Predictive alignmenf (local-p) - inslead of as¬ 
suming monofonic alignmenls, our model predicls 
an aligned position as follows: 

Pt = S ■ sigmoid(r)J tanh{Wpht)), (9) 

Wp and Vp are the model parameters which will 
be learned to predict positions. S is the source sen¬ 
tence length. As a result of sigmoid, pt € [0, 5]. 
To favor alignment points near pt, we place a 
Gaussian distribution centered around pt . Specif¬ 
ically, our alignment weights are now defined as: 

at{s) = align(h.t, hg) exp (^0) 

We use fhe same align function as in Eq. © and 
fhe slandard deviafion is empirically sef as cr = y. 
Nofe thal pt is a real nummber; whereas s is an 
integer wifhin fhe window centered al 

*If the window crosses the sentence boundaries, we sim¬ 
ply ignore the outside part and consider words in the window. 

'^local-m is the same as the global model except that the 
vector at is fixed-length and shorter. 

local-p is similar to the local-m model except that we dy¬ 
namically compute Pt and use a truncated Gaussian distribu¬ 
tion to modify the original alignment weights alignfht, ha) 
as shown in Eq. (do}. By utilizing pt to derive at, we can 
compute backprop gradients for Wp and Vp. This model is 
differentiable almost everywhere. 
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Figure 4: Input-feeding approach - Attentional 
vectors ht are fed as inputs to the next time steps to 
inform the model about past alignment decisions. 


Comparison to ^Gregor et aL, 2077] ) - have pro¬ 
posed a selective attention mechanism, very simi¬ 
lar to our local attention, for the image generation 
task. Their approach allows the model to select an 
image patch of varying location and zoom. We, 
instead, use the same “zoom” for all target posi¬ 
tions, which greatly simplifies the formulation and 
still achieves good performance. 

3.3 Input-feeding Approach 

In our proposed global and local approaches, 
the attentional decisions are made independently, 
which is suboptimal. Whereas, in standard MT, 
a coverage set is often maintained during the 
translation process to keep track of which source 
words have been translated. Likewise, in atten¬ 
tional NMTs, alignment decisions should be made 
jointly taking into account past alignment infor¬ 
mation. To address that, we propose an input¬ 
feeding approach in which attentional vectors ht 
are concatenated with inputs at the next time steps 
as illustrated in Figure The effects of hav¬ 
ing such connections are two-fold: (a) we hope 
to make the model fully aware of previous align¬ 
ment choices and (b) we create a very deep net¬ 
work spanning both horizontally and vertically. 

Comparison to other work - 
Bahdanau et al. (2015| ) use context vectors, 
similar to our Cj, in building subsequent hidden 
states, which can also achieve the “coverage” 
effect. However, there has not been any analysis 
of whether such connections are useful as done 

"if n is the number of LSTM cells, the input size of the 
first LSTM layer is 2n; those of subsequent layers are n. 


in this work. Also, our approach is more general; 
as illustrated in Figure IH it can be applied to 
general stacking recurrent architectures, including 
non-attentional models. 

|Xu et al. (20151 ) propose a doubly attentional 
approach with an additional constraint added to 
the training objective to make sure the model pays 
equal attention to all parts of the image during the 
caption generation process. Such a constraint can 
also be useful to capture the coverage set effect 
in NMT that we mentioned earlier. However, we 
chose to use the input-feeding approach since it 
provides flexibility for the model to decide on any 
attentional constraints it deems suitable. 


4 Experiments 


We evaluate the effectiveness of our models 
on the WMT translation tasks between En¬ 
glish and German in both directions. new- 
stest2013 (3000 sentences) is used as a develop¬ 
ment set to select our hyperparameters. Transla¬ 
tion performances are reported in case-sensitive 
BLEU ( Papineni et ah, 2002] | on newstest2014 
(2737 sentences) and newstest2015 (2169 sen¬ 
tences). Eollowing ( Luong et ah, 2015] ), we report 
translation quality using two types of BLEU: (a) 
tokenizeJ^ BLEU to be comparable with existing 
NMT work and (b) WS'tEI BLEU to be compara¬ 
ble with WMT results. 


4.1 Training Details 

All our models are trained on the WMT’ 14 train¬ 
ing data consisting of 4.5M sentences pairs (116M 
English words, llOM German words). Similar 
to dJean et ah, 2015| l, we limit our vocabularies to 
be the top 50K most frequent words for both lan¬ 
guages. Words not in these shortlisted vocabular¬ 
ies are converted into a universal token <unk>. 

When training our NMT systems, following 
dBahdanau et ah, 2015 1 |Jean et ah, 2015D , we fil¬ 
ter out sentence pairs whose lengths exceed 
50 words and shuffle mini-batches as we pro¬ 
ceed. Our stacking LSTM models have 4 lay¬ 
ers, each with 1000 cells, and 1000-dimensional 


embeddings. We follow dSutskever et ah, 2014 


Luong et ah, 20151 in training NMT with similar 


settings: (a) our parameters are uniformly initial¬ 
ized in [—0.1,0.1], (b) we train for 10 epochs us- 


*^A11 texts are tokenized with tokenizer.perl and 
BLEU scores are computed with raulti-bleu . perl, 
"with the mteval-vl3a script as per WMT guideline. 











































System 

Ppl 

BLEU 

Winning WMT’14 system - phrase-based + large LM (|Buck et al., 2014 

1 


20.7 


Existing NMT systems 


RNNsearch (jJean et al., 20151 


16.5 

RNNsearch -i- unk replace (|Jean et al., 2015|) 


19.0 

RNNsearch -i- unk replace -i- large vocab -i- ensemble 8 models (|Jean et al., 2015|l 


21.6 


Our NMT systems 


Base 

Base - 1 - reverse 

Base - 1 - reverse -i- dropout 

10.6 

9.9 

8.1 

11.3 

12.6 {+L3) 
14.0 {+L4) 

Base - 1 - reverse -i- dropout -i- global attention {location) 

7.3 

16.8 {+2.8) 

Base - 1 - reverse -i- dropout -i- global attention {location) feed input 

6.4 

18.1 {+L3) 

Base - 1 - reverse -i- dropout -i- local-p attention {general) -\- feed input 

Base - 1 - reverse -i- dropout -i- local-p attention {general) -\- feed input -i- unk replace 

5.9 

19.0 {+0.9) 
20.9 {+L9) 

Ensemble 8 models -i- unk replace 


23.0 {+2.1) 


Table 1: WMT’14 English-German results - shown are the perplexities (ppl) and the tokenized BLEU 
scores of various systems on newstest2014. We highlight the best system in bold and give progressive 
improvements in italic between consecutive systems. locaTp referes to the local attention with predictive 
alignments. We indicate for each attention model the alignment score function used in pararentheses. 


ing plain SGD, (c) a simple learning rate sched¬ 
ule is employed - we start with a learning rate of 
1; after 5 epochs, we begin to halve the learning 
rate every epoch, (d) our mini-batch size is 128, 
and (e) the normalized gradient is rescaled when¬ 
ever its norm exceeds 5. Additionally, we also 
use dropout with probability 0.2 for our LSTMs as 
suggested by dZaremba et ah, 20T5] |. For dropout 
models, we train for 12 epochs and start halving 
the learning rate after 8 epochs. For local atten¬ 
tion models, we empirically set the window size 
D = 10. 

Our code is implemented in MATFAB. When 
running on a single GPU device Tesla K40, we 
achieve a speed of IK target words per second. 
It takes 7-10 days to completely train a model. 

4.2 English-German Results 

We compare our NMT systems in the English- 
German task with various other systems. These 
include the winning system in WMT’14 
( [Buck et al., 2014| ), a phrase-based system 
whose language models were trained on a huge 
monolingual text, the Common Crawl corpus. 
For end-to-end NMT systems, to the best of 
our knowledge, dJgan et al., 201 5| l is the only 
work experimenting with this language pair and 
currently the SOTA system. We only present 
results for some of our attention models and will 
later analyze the rest in Section 

As shown in Table [T] we achieve pro¬ 


gressive improvements when (a) reversing the 
source sentence, -1-1.3 BFEU, as proposed in 
dSutskever et al., 2014|| and (b) using dropout. 


-1-1.4 BFEU. On top of that, (c) the global atten¬ 
tion approach gives a signihcant boost of -1-2.8 
BFEU, making our model slightly better than the 
base attentional system of [Bahdanau et al. (2015| 
(row RNNSearch). When (d) using the input¬ 
feeding approach, we seize another notable gain 
of -1-1.3 BFEU and outperform their system. The 
local attention model with predictive alignments 
(row locaTp) proves to be even better, giving 
us a further improvement of -1-0.9 BFEU on top 
of the global attention model. It is interest¬ 
ing to observe the trend previously reported in 
( Fuong et al., 2015] ) that perplexity strongly corre¬ 
lates with translation quality. In total, we achieve 
a signihcant gain of 5.0 BFEU points over the 
non-attentional baseline, which already includes 
known techniques such as source reversing and 
dropout. 

The unknown replacement technique proposed 
in ( Fuong et al., 2015l|Jean et al., 2015| ) yields an¬ 
other nice gain of -1-1.9 BFEU, demonstrating that 
our attentional models do learn useful alignments 
for unknown works. Finally, by ensembling 8 
different models of various settings, e.g., using 
different attention approaches, with and without 
dropout etc., we were able to achieve a new SOTA 
result of 23.0 BFEU, outperforming the existing 












































best system dJean et al., 2015| ) by +1.4 BLEU. 


System 

BLEU 

Top - NMT + 5-gram rerank (Montreal) 

24.9 

Our ensemble 8 models + unk replace 

25.9 


Table 2: WMT’15 English-German results - 

NIST BLEU scores of the winning entry in 
WMT’ 15 and our best one on newstest2015. 


System 

Ppl. 

BLEU 

WMT’15 systems 

SOTA - phrase-based (Edinburgh) 


29.2 

NMT + 5-gram rerank (MILA) 


27.6 


Our NMT systems 


Base (reverse) 

14.3 

16.9 

+ global {location) 

12.7 

19.1 (+2.2) 

+ global {location) + feed 

10.9 

20.1 {+1.0) 

+ global {dot) + drop + feed 

9 7 

22.8 (+2.7) 

+ global {dot) + drop + feed + unk 


24.9 (+2.7) 


Latest results in WMT’15 - despite the fact that 
our models were trained on WMT’ 14 with slightly 
less data, we test them on newstest2015 to demon¬ 
strate that they can generalize well to different test 
sets. As shown in Table |2j our best system es¬ 
tablishes a new SOTA performance of 25.9 BLEU, 
outperforming the existing best system backed by 
NMT and a 5-gram EM reranker by +1.0 BLEU. 

4.3 German-English Results 

We carry out a similar set of experiments for the 
WMT’15 translation task from German to En¬ 
glish. While our systems have not yet matched 
the performance of the SOTA system, we never¬ 
theless show the effectiveness of our approaches 
with large and progressive gains in terms of BLEU 
as illustrated in Table [3] The attentional mech¬ 
anism gives us +2.2 BLEU gain and on top of 
that, we obtain another boost of up to +1.0 BLEU 
from the input-feeding approach. Using a better 
alignment function, the content-based dot product 
one, together with dropout yields another gain of 
+2.7 BLEU. Lastly, when applying the unknown 
word replacement technique, we seize an addi¬ 
tional +2.1 BLEU, demonstrating the usefulness 
of attention in aligning rare words. 

5 Analysis 

We conduct extensive analysis to better understand 
our models in terms of learning, the ability to han¬ 
dle long sentences, choices of attentional architec¬ 
tures, and alignment quality. All results reported 
here are on English-German newstest2014. 


Table 3: WMT’15 German-English results - 

performances of various systems (similar to Ta¬ 
ble [U). The base system already includes source 
reversing on which we add global attention, 
dropout, input/eer/ing, and unk replacement. 



Mini-batches 

Eigure 5: Learning curves - test cost (In perplex¬ 
ity) on newstest2014 for English-German NMTs 
as training progresses. 

+ curve) learns slower than other non-dropout 
models, but as time goes by, it becomes more ro¬ 
bust in terms of minimizing test errors. 

5.2 Effects of Translating Long Sentences 

We follow dBahdanau et al., 2015| l to group sen¬ 
tences of similar lengths together and compute 
a BLEU score per group. Eigure shows that 
our attentional models are more effective than the 
non-attentional one in handling long sentences: 
the quality does not degrade as sentences become 
longer. Our best model (the blue + curve) outper¬ 
forms all other systems in all length buckets. 


5.1 Learning curves 

We compare models built on top of one another as 
listed in Table [U It is pleasant to observe in Eig¬ 
ure |5] a clear separation between non-attentional 
and attentional models. The input-feeding ap¬ 
proach and the local attention model also demon¬ 
strate their abilities in driving the test costs lower. 
The non-attentional model with dropout (the blue 


5.3 Choices of Attentional Architectures 

We examine different attention models {global, 
locaTm, local-p) and different alignment func¬ 
tions {location, dot, general, concat) as described 
in Section [3l Due to limited resources, we can¬ 
not run all the possible combinations. However, 
results in Table |4] do give us some idea about dif¬ 
ferent choices. The location-based function does 
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Figure 6: Length Analysis - translation qualities 
of different systems as sentences become longer. 


System 

Ppl 

BLEU 

Before 

After unk 

global (location) 

6.4 

18.1 

19.3 (-fl.2) 

global (dot) 

6.1 

18.6 

20.5 (-hi.9) 

global (general) 

6.1 

17.3 

19.1 (-fl.8) 

local-m (dot) 

>7.0 

X 

X 

local-m (general) 

6.2 

18.6 

20.4 (-fl.8) 

local-p (dot) 

6.6 

18.0 

19.6 (-fl.9) 

local-p (general) 

5.9 

19 

20.9 (-1-1.9) 


Table 4: Attentional Architectures - perfor¬ 
mances of different attentional models. We trained 
two local-m (dot) models; both have ppl > 7.0. 


not learn good alignments: the global (location) 
model can only obtain a small gain when per¬ 
forming unknown word replacement compared to 
using other alignment functions0 For content- 
based functions, our implementation concat does 
not yield good performances and more analysis 
should be done to understand the reason^ It is 
interesting to observe that dot works well for the 
global attention and general is better for the local 
attention. Among the different models, the local 
attention model with predictive alignments (local- 
p) is best, both in terms of perplexities and BLEU. 


5.4 Alignment Quality 


A by-product of attentional models are word align¬ 
ments. While dBahdanau et al., 2015|) visualized 


'"^There is a subtle difference in how we retrieve align¬ 
ments for the different alignment functions. At time step t in 
which we receive yt-i as input and then compute ht,at,ct, 
and ht before predicting yt, the alignment vector at is used 
as alignment weights for (a) the predicted word yt in the 
location-based alignment functions and (h) the input word 
yt-i in the content-based functions. 

'^With concat, the perplexities achieved by different mod¬ 
els are 6.7 (global), 7.1 (local-m), and 7.1 (local-p). Such 
high perplexities could he due to the fact that we simplify the 
matrix Wa, to set the part that corresponds to ha to identity. 


Method 

AER 

global (location) 

0.39 

local-m (general) 

0.34 

local-p (general) 

0.36 

ensemble 

0.34 

Berkeley Aligner 

0.32 


Table 6: AER scores - results of various models 
on the RWTH English-German alignment data. 


alignments for some sample sentences and ob¬ 
served gains in translation quality as an indica¬ 
tion of a working attention model, no work has as¬ 
sessed the alignments learned as a whole. In con¬ 
trast, we set out to evaluate the alignment quality 
using the alignment error rate (AER) metric. 

Given the gold alignment data provided by 
RWTH for 508 English-German Europarl sen¬ 
tences, we “force” decode our attentional models 
to produce translations that match the references. 
We extract only one-to-one alignments by select¬ 
ing the source word with the highest alignment 
weight per target word. Nevertheless, as shown in 
Tabled we were able to achieve AER scores com¬ 
parable to the one-to-many alignments obtained 
by the Berkeley aligner ( Liang et al., 2006| ) 

We also found that the alignments produced by 
local attention models achieve lower AERs than 
those of the global one. The AER obtained by the 
ensemble, while good, is not better than the local- 
m AER, suggesting the well-known observation 
that AER and translation scores are not well cor¬ 
related ( [Eraser and Marcu, 2007| ). We show some 
alignment visualizations in Appendix lAl 


5.5 Sample Translations 

We show in Table [5] sample translations in both 
directions. It it appealing to observe the ef¬ 
fect of attentional models in correctly translating 
names such as “Miranda Kerr” and “Roger Dow”. 
Non-attentional models, while producing sensi¬ 
ble names from a language model perspective, 
lack the direct connections from the source side 
to make correct translations. We also observed 
an interesting case in the second example, which 
requires translating the doubly-negated phrase, 
“not incompatible”. The attentional model cor¬ 
rectly produces “nicht ... unvereinbar”; whereas 
the non-attentional model generates “nicht verein- 

*®We concatenate the 508 sentence pairs with IM sentence 
pairs from WMT and run the Berkeley aligner. 

























English-German translations 


src 

Orlando Bloom and Miranda Kerr still love each other 

ref 

Orlando Bloom und Miranda Kerr lieben sich noch immer 

best 

Orlando Bloom und Miranda Kerr lieben einander noch immer . 

base 

Orlando Bloom und Lucas Miranda lieben einander noch immer . 

src 

" We ' re pleased the FAA recognizes that an enjoyable passenger experience is not incompatible 
with safety and security , " said Roger Dow , CEO of the U.S. Travel Association . 

ref 

“ Wir freuen uns , dass die FAA erkennt, dass ein angenehmes Passagiererlebnis nicht im Wider¬ 
spruch zur Sicherheit steht ” , sagte Roger Dow , CEO der U.S. Travel Association . 

best 

" Wir freuen uns , dass die FAA anerkennt , dass ein angenehmes ist nicht mit Sicherheit und 
Sicherheit unvereinbar ist " , sagte Roger Dow , CEO der US - die . 

base 

" Wir freuen uns iiber die <unk> , dass ein <unk> <unk> mit Sicherheit nicht vereinbar ist mit 
Sicherheit und Sicherheit " , sagte Roger Cameron , CEO der US - <unk> . 


German-English translations 


src 

In einem Interview sagte Bloom jedoch , dass er und Kerr sich noch immer lieben . 

ref 

However, in an interview , Bloom has said that he and Kerr still love each other . 

best 

In an interview , however , Bloom said that he and Kerr still love . 

base 

However, in an interview , Bloom said that he and Tina were still <unk> . 

src 

Wegen der von Berlin und der Europaischen Zentralbank verhangten strengen Spaipolitik in 
Verbindung mit der Zwangsjacke , in die die jeweilige nationale Wirtschaft durch das Festhal- 
ten an der gemeinsamen Wahrung genotigt wird , sind viele Menschen der Ansicht, das Projekt 
Europa sei zu weit gegangen 

ref 

The austerity imposed by Berlin and the European Central Bank , coupled with the straitjacket 
imposed on national economies through adherence to the common currency , has led many people 
to think Project Europe has gone too far . 

best 

Because of the strict austerity measures imposed by Berlin and the European Central Bank in 
connection with the straitjacket in which the respective national economy is forced to adhere to 
the common cuiTency , many people believe that the European project has gone too far . 

base 

Because of the pressure imposed by tbe European Central Bank and tbe Federal Central Bank 
with the strict austerity imposed on the national economy in the face of the single currency , 
many people believe that the European project has gone too far . 


Table 5: Sample translations - for each example, we show the source (src), the human translation {ref), 
the translation from our best model {best), and the translation of a non-attentional model {base). We 
italicize some correct translation segments and highlight a few wrong ones in bold. 


bar”, meaning “not compatible” 0 The attentional 
model also demonstrates its superiority in translat¬ 
ing long sentences as in the last example. 

6 Conclusion 

In this paper, we propose two simple and effective 
attentional mechanisms for neural machine trans¬ 
lation: the global approach which always looks 
at all source positions and the local one that only 
attends to a subset of source positions at a time. 
We test the effectiveness of our models in the 
WMT translation tasks between English and Ger¬ 
man in both directions. Our local attention yields 
large gains of up to 5.0 BLEU over non-attentional 

'^The reference uses a more fancy translation of “incom¬ 
patible”, which is “im Widerspruch zu etwas stehen”. Both 
models, however, failed to translate “passenger experience”. 


models which already incorporate known tech¬ 
niques such as dropout. Eor the English to Ger¬ 
man translation direction, our ensemble model has 
established new state-of-the-art results for both 
WMT’14 and WMT’15, outperforming existing 
best systems, backed by NMT models and n-gram 
EM rerankers, by more than 1.0 BLEU. 

We have compared various alignment functions 
and shed light on which functions are best for 
which attentional models. Our analysis shows that 
attention-based NMT models are superior to non- 
attentional ones in many cases, for example in 
translating names and handling long sentences. 
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A Alignment Visualization 


We visualize the alignment weights produced by 
our different attention models in Figure |7] The vi¬ 
sualization of the local attention model is much 
sharper than that of the global one. This contrast 
matches our expectation that local attention is de¬ 
signed to only focus on a subset of words each 
time. Also, since we translate from English to Ger¬ 
man and reverse the source English sentence, the 
white strides at the words “reality” and in the 
global attention model reveals an interesting ac¬ 
cess pattern: it tends to refer back to the beginning 
of the source sequence. 

Compared to the alignment visualizations in 
( [Bahdanau et ah, 2015| ), our alignment patterns 
are not as sharp as theirs. Such difference could 
possibly be due to the fact that translating from 
English to German is harder than translating into 
Erench as done in ( [Bahdanau et ah, 2015| ), which 
is an interesting point to examine in future work. 







in 

Wirklichkeit 



in 

Wirklichkeit 



Figure 7: Alignment visualizations - shown are images of the attention weights learned by various 
models: (top left) global, (top right) loeal-m, and (bottom left) loeal-p. The gold alignments are displayed 
at the bottom right eorner. 















